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Abstract 

Our understanding of information in systems has been based on the foundation of memoryless processes. 
Extensions to stable Markov and auto-regressive processes are classical. Berger proved a source coding theorem 
for the marginally unstable Wiener process, but the infinite-horizon exponentially unstable case has been open since 
Gray's 1970 paper. There were also no theorems showing what is needed to communicate such processes across 
noisy channels. 

In this work, we give a fixed-rate source-coding theorem for the infinite-horizon problem of coding an expo- 
nentially unstable Markov process. The encoding naturally results in two distinct bitstreams that have qualitatively 
different QoS requirements for communicating over a noisy medium. The first stream captures the information that 
is accumulating within the nonstationary process and requires sufficient anytime reliability from the channel used to 
communicate the process. The second stream captures the historical information that dissipates within the process 
and is essentially classical. This historical information can also be identified with a natural stable counterpart to 
the unstable process. A converse demonstrating the fundamentally layered nature of unstable sources is given by 
means of information-embedding ideas. 
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Source coding and channel requirements for unstable processes 

I. Introduction 

The source and channel models studied in information theory are not just interesting in their own right, 
but also provide insights into the architecture of reliable communication systems. Since Shannon's work, 
memoryless sources and channels have always been at the base of our understanding. They have provided 
the key insight of separating source and channel coding with the bit rate alone appearing at the interface 
[1], [2]. The basic story has been extended to many different sources and channels with memory for 
point-to-point communication [3]. 

However, there are still many issues for which information theoretic understanding eludes us. Net- 
working in particular has a whole host of such issues, leading Ephremides and Hajek to entitle their 
survey article "Information Theory and Communication Networks: An Unconsummated Union!" [4]. They 
comment: 

The interaction of source coding with network-induced delay cuts across the classical network layers and has 
to be better understood. The interplay between the distortion of the source output and the delay distortion induced 
on the queue that this source output feeds into may hold the secret of a deeper connection between information 
theory. Again, feedback and delay considerations are important. 
Real communication networks and networked applications are quite complicated. To move toward a 
quantitative and qualitative of understanding of the issues, tractable models that exhibit at least some of 
the right qualitative behavior are essential. In [5], [6], the problem of stabilization of unstable plants across 
a noisy feedback link is considered. There, delay and feedback considerations become intertwined and the 
notion of feedback anytime capacity is introduced. To stabilize an otherwise unstable plant over a noisy 
channel, not only is it necessary to have a channel capable of supporting a certain minimal rate, but the 
channel when used with noiseless feedback must also support a high enough error-exponent (called the 
anytime reliability) with fixed delay in a delay-universal fashion. This turns out to be a sufficient condition 
as well, thereby establishing a separation theorem for stabilization. In [7], upper bounds are given for 
the fixed-delay reliability functions of DMCs with and without feedback, and these bounds are shown to 
be tight for certain classes of channels. Moreover, the fixed-delay reliability functions with feedback are 
shown to be fundamentally better than the traditional fixed-block-length reliability functions. 

While the stabilization problem does provide certain important insights into interactive applications, 
the separation theorem for stabilization given in [5], [6] is coarse — it only addresses performance as a 
binary valued entity: stabilized or not stabilized. All that matters is the tail-behavior of the closed-loop 
process. To get a more refined view in terms of steady-state performance, this paper instead considers 
the corresponding open-loop estimation problem. This is the seemingly classical question of lossy source 
coding for an unstable scalar Markov processes — mapping the source into bits and then seeing what is 
required to communicate such bits using a point-to-point communication system. 

A. Communication of Markov processes 

Coding theorems for stable Markov and auto-regressive processes under mean-squared-error distortion 
are now well established in the literature [8], [9]. We consider real-valued Markov processes, modeled as 

X t+1 = \X t + W t (1) 

where {Wt}t>o are white and X is an independent initial condition uniformly distributed on [— +^] 
where Q > is small. The essence of the problem is depicted in Fig. [Q to minimize the rate of the 
encoding while maintaining an adequate fidelity of reconstruction. Once the source has been compressed, 
the resulting bitstreams can presumably be reliably communicated across a wide variety of noisy channels. 

The infinite-horizon source-coding problem is to design a source code minimizing the rate R used 
to encode the process while keeping the reconstruction close to the original source in an average sense 
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Fig. 1. The point-to-point communication problem considered here. The goal is to minimize end-to-end average distortion p(X t ,X t ). Finite, 
but possible large, end-to-end delay will be permitted. One of the key issues explored is what must be made available at the source/channel 
interface. 



Hnv^oo - Y^t=i E[\X t ~ Xt\ v ]. The key issue is that any given encoder/decoder system must have a 
bounded delay when used over a fixed-rate noiseless channel. The encoder is not permitted to look into 
the entire infinite future before committing to an encoding for X t . To allow the laws of large numbers to 
work, a finite, but potentially large, end-to-end delay is allowed between when the encoder observes X t 
and when the decoder emits X t . However, this delay must remain bounded and not grow with t. 

For the stable cases |A| < 1, standard block-coding arguments work since long blocks separated by an 
intervening block look relatively independent of each other and are in their stationary distributions. The 
ability to encode blocks in an independent way also tells us that Shannon's classical sense of e-reliability 
also suffices for communicating the encoded bits across a noisy channel. The study of unstable cases 
|A| > 1 is substantially more difficult since they are neither ergodic nor stationary and furthermore their 
variance grows unboundedly with time. As a result, Gray was able to prove only finite horizon results 
for such nonstationary processes and the general infinite-horizon unstable case has remained essentially 
open since Gray's 1970 paper [9]. As he put it: 

It should be emphasized that when the source is non-stationary, the above theorem is not as powerful as one 
would like. Specifically, it does not show that one can code a long sequence by breaking it up into smaller blocks 
of length ?i and use the same code to encode each block. The theorem is strictly a "one-shot" theorem unless the 
source is stationary, simply because the blocks [(k — l)n, kn] do not have the same distribution for unequal k when 
the source is not stationary. 

On the computational side, Hashimoto and Arimoto gave a parametric form for computing the R(d) 
function for unstable auto-regressive Gaussian processes [10] and mean-square distortion. Toby Berger 
gave an explicit coding theorem for an important sub-case, the marginally unstable Wiener process with 
A = 1, by introducing an ingenious parallel stream methodology. He noticed that although the Wiener 
process is nonstationary, it does have stationary and independent increments [11]. However, Berger's 
source-coding theorem said nothing about what is required from a noisy channel. In his own words: [12] 
It is worth stressing that we have proved only a source coding theorem for the Wiener process, not an 
information transmission theorem. If uncorrected channel errors were to occur, even in extremely rare instances, the 
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user would eventually lose track of the Wiener process completely. It appears (although it has never been proved) 
that, even if a noisy feedback link were provided, it still would not be possible to achieve a finite [mean squared 
error] per letter as t — > oo. 

In an earlier conference work [13] and the first author's dissertation [14], we gave a variable rate coding 
theorem that showed the R(d) bound is achievable in the infinite-horizon case if variable-rate codes are 
allowed. The question of whether or not fixed-rate and finite-delay codes could be made to work was left 
open, and is resolved here along with a full information transmission theorem. 

B. Asymptotic equivalences and direct reductions 

Beyond the technical issue of fixed or variable rate lies a deeper question regarding the nature of 
"information" in such processes. [15] contains an analysis of the traditional Kalman-Bucy filter in which 
certain entropic expressions are identified with the accumulation and dissipation of information within a 
filter. No explicit source or channel coding is involved, but the idea of different kinds of information flows 
is raised through the interpretation of certain mutual information quantities. In the stabilization problem 
of [5], it is hard to see if any qualitatively distinct kinds of information are present since to an external 
observer, the closed-loop process is stable. 

Similarly, the variable-rate code given earlier in [13], [14] does not distinguish between kinds of 
information since the same high QoS requirements were imposed on all bits. However, it was clear that 
all the bits do not require the same treatment since there are examples in which access to an additional 
lower reliability medium can be used to improve end-to-end performance [16], [14]. The true nature of 
the information within the unstable process was left open and while exponentially unstable processes 
certainly appeared to be accumulating information, there was no way to make this interpretation precise 
and quantify the amount of accumulation. 

In order to understand the nature of information, this paper builds upon the "asymptotic communication 
problem equivalence" perspective introduced at the end of [5]. This approach associates communication 
problems (e.g. communicating bits reliably at rate R or communicating iid Gaussian random variables to 
average distortion < d) with the set of channels that are good enough to solve that problem (e.g. noisy 
channels with capacity C > R). This parallels the "asymptotic computational problem equivalence" 
perspective in computational complexity theory [17] except that the critical resource shifts from compu- 
tational operations to noisy channel uses. The heart of the approach is the use of "reductions" that show 
that a system made to solve one communication problem can be used as a black box to solve another 
communication problem. Two problems are asymptotically equivalent if they can be reduced to each other. 

The equivalence perspective is closely related to the traditional source/channel separation theorems. The 
main difference is that traditional separation theorems give a privileged position to one communication 
problem — reliable bit-transport in the Shannon sense — and use reductions in only one direction: from 
the source to bits. The "converse" direction is usually proved using properties of mutual information. In 
[18], [19], we give a direct proof of the "converse" for classical problems by showing the existence of 
randomized codes that embed iid message bits into iid seeming source symbols at rate R. The embedding 
is done so that the bits can be recovered with high probability from distorted reconstructions as long as 
the average distortion on long blocks stays below the distortion-rate function D(R). Similar results are 
obtained for the conditional distortion-rate function. This equivalence approach to separation theorems 
considers the privileged position of reliable bit-transport to be purely a pedagogical matter. 

This paper uses the results from [18], [19] to extend the results of [5] from the control context to the 
estimation context. We demonstrate that the problem of communicating an unstable Markov process to 
within average distortion d is asymptotically equivalent to a pair of communication problems: classical 
reliable bit-transport at a rate ~ R(d) — log 2 |A| and anytime-reliable bit-transport at a rate ~ log 2 |A|. 
This gives a precise interpretation to the nature of information flows in such processes. 
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C. Outline 

Section [TT] states the main results of this paper. A brief numerical example for the Gaussian case is 
given to illustrate the behavior of such unstable processes. The proofs follow in the subsequent sections. 

Section UTTI considers lossy source coding for unstable Markov processes with the driving disturbance W t 
constrained to have bounded support. A fixed-rate code at a rate arbitrarily close to R(d) is constructed 
by encoding process into two simultaneous fixed-rate message streams. The first stream has a bit-rate 
arbitrarily close to log 2 |A| and encodes what is needed from the past to understand the future. It captures 
the information that is accumulating within the unstable process. The other stream encodes those aspects 
of the past that are not relevant to the future and so captures the purely historical aspects of the unstable 
process in a way that meets the average distortion constraint. This second stream can be made to have a 
rate arbitrarily close to R(d) — log 2 |A|. 

Section [IV] then examines this historical information more carefully by looking at the process formally 
going backward in time. The R(d) curve for the unstable process is shown to have a shape that is the 
stable historical part translated by log 2 |A| to account for the unstable accumulation of information. 

Section [V] first reviews the fact that random codes exist achieving anytime reliability over noisy channels 
even without any feedback. Then, for ^-difference distortion measures, an anytime reliability > 77log 2 |A| 
is shown to be sufficient to encode the first bitstream of the code of Section Hill across a noisy channel. The 
second bitstream is shown to only require classical Shannon e-reliability. This completes the reduction of 
the lossy-estimation problem to a two-tiered reliable bit-transportation problem and resolves the conjecture 
posed by Berger regarding an information transmission theorem for unstable processes. 

Section [VI] tackles the other direction. The problem of anytime-reliable bit-transport is directly reduced 
to the problem of lossy-estimation for a decimated version of the unstable process. This is done using 
the ideas in [5], reinterpreted as information-embedding and shows that the higher QoS requirements for 
the first stream are unavoidable for these processes. A second stream of messages is then embedded into 
the historical segments of the unstable process and this stream is recovered in the classical Shannon e- 
reliable sense. Exponentially unstable Markov processes are thus the first nontrivial examples of stochastic 
processes that naturally generate two qualitatively distinct kinds of information. 

In Section IVIIl the results are then extended to cover the Gauss-Markov case with the usual squared- 
error distortion. Although the proofs are given in terms of Gaussian processes and squared error, the 
results actually generalize to any ^-distortion as well as driving noise distributions W that have at least 
an exponentially decaying tail. 

This paper focuses throughout on scalar Markov processes for clarity. It is possible to extend all the 
arguments to cover the general autoregressive moving average (ARMA) case. The techniques used to 
cover the ARMA case are discussed in the control context in [6] where a state-space formulation is used. 
A brief discussion of how to apply those techniques is present here in Section IVIIIl 



II. Main results 
A. Performance bound in the limit of large delays 

To define R(d) for unstable Markov processes, the infinite-horizon problem is viewed as the limit of a 
sequence of finite-horizon problems: 

Definition 2.1: Given the scalar Markov source given by (Q~|), the finite n-horizon version of the source 
is defined to be the random variables Xq _1 = (X , X±, . . . , X n _i). 

Definition 2.2: The 77— distortion measure is p(Xi, X,i) = \Xi — X\ n . It is an additive distortion measure 
when applied to blocks. 

The standard information-theoretic rate-distortion function for the finite-horizon problem using rj- 
difference distortion is: 

Rn(d)= inf -Kxr^yr 1 ) w 

Pff-'W"^^ E[\x t -v^]<d} n 
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We can consider the block X" as a single vector- valued random variable X. The R*{d) defined by 
© is related to Rf(d) by R*{d) = -Rf(nd) with the distortion measure on X given by p(X,X) = 

Y£\Xi-x\ n - 

The infinite-horizon case is then defined as a limit: 

i£(d) =liminfi£(d) (3) 

n— >oo 

The distortion-rate function D^(R) is also defined in the same manner, except that the mutual- 
information is fixed and the distortion is what is infimized. 



B. The stable counterpart to the unstable process 

It is insightful to consider what the stable counterpart to this unstable process would be. There is a 
natural choice, just formally turn the recurrence relationship around and flip the order of time. This gives 
the "backwards in time process" governed by the recursion 

% = \- l X t +x - ^W t . (4) 

This is purely a formal reversal. In place of an initial condition X , it is natural to consider a X n for 
some time n and then consider time going backwards from there. Since |A _1 | < 1, this is a stable Markov 
process and falls under the classical theorems of [9]. 



C. Encoders and decoders 

For notational convenience, time is synchronized between the source and the channel. Thus both delay 
and rate can be measured against either source symbols or channel uses. 

Definition 2.3: A discrete time channel is a probabilistic system with an input. At every time step t, it 
takes an input a t E A and produces an output c t E C with probability p(C t \a\, c^" 1 ) where the notation a\ 
is shorthand for the sequence (ai, a 2 , . . . , a t ). In general, the current channel output is allowed to depend 
on all inputs so far as well as on past outputs. 

The channel is memoryless if conditioned on a t , the random variable C t is independent of any other 
random variable in the system that occurs at time t or earlier. So all that needs to be specified is p t (C t \a t ). 
The channel is memoryless and stationary if p t (C t \a t ) = p(C t \a t ) for all times t. 

Definition 2.4: A rate R source-encoder E s is a sequence of maps {£ s ,i}- The range of each map is 
a single bit hi E {0,1} if it is a pure source encoder and is from the channel input alphabet A if it is a 

joint source-channel encoder. The z-th map takes as input the available source symbols X 1 R . 

Similarly, a rate R channel-encoder 8 C without feedback is a sequence of maps {S ct \. The range of 

1 i fit i 

each map is the channel input alphabet A. The £-fh map takes as input the available message bits B[ . 

Randomized encoders also have access to random variables denoting the common randomness available 
in the system. This common randomness is independent of the source and channel. 

Definition 2.5: A delay rate R source-decoder is a sequence of maps {T> s t }. The range of each map 
is just an estimate X t for the t-th source symbol. For pure source decoders, the t-th map takes as input the 
available message bits i^p+w-RJ For joint source-channel decoders, it takes as input the available channel 
outputs C{ + *. Either way, it can see <fi ti me units beyond the time when the desired source symbol first 
had a chance to impact its inputs. 

Similarly, a delay (p rate R channel-decoder is a sequence of maps {T> c ,i}- The range of each map is 
just an estimate Bi for the i-th bit taken from {0, 1}. The i-th map takes as input the available channel 

outputs c[ R + ^ which means that it can see time units beyond the time when the desired message bit 
first had a chance to impact the channel inputs. 

Randomized decoders also have access to the random variables denoting common randomness. 
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Fig. 2. The timeline in a rate | delay 7 channel code. Both the encoder and decoder must be causal so Ai and Bi are functions only of 
quantities to the left of them on the timeline. If noiseless feedback is available, the Ai can also have an explicit functional dependence on 
the CI" 1 that lie to the left on the timeline. 
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Fig. 3. The source-coding problem of translating the source into two simultaneous bitstreams of fixed rates Ri and R 2 . End-to-end delay 
is permitted but must remain bounded for all time. The goal is to get R\ ~ log 2 |A| and R 2 ~ R(d) — log 2 |A|. 

The timeline is illustrated in Fig. |2] for channel coding and a similar timeline holds for either pure 
source coding or joint source-channel coding. 

For a specific channel, the maximum rate achievable for a given sense of reliable communication is 
called the associated capacity. Shannon's classical e-reliability requires that for a suitably large end-to-end 
delay 0, the probability of error on each bit is below a specified e. 

Definition 2.6: A rate R anytime communication system over a noisy channel is a single channel encoder 
S c and decoder Vj! family for all end-to-end delays 0. 

A rate R communication system achieves anytime reliability a if there exists a constant K such that: 

V{B[{t) ^ B\) < K2~ a{t -^ (5) 

holds for every i. The probability is taken over the channel noise, the message bits B, and all of the 
common randomness available in the system. If © holds for every possible realization of the message 
bits B, then we say that the system achieves uniform anytime reliability a. 

Communication systems that achieve anytime reliability are called anytime codes and similarly for 
uniform anytime codes. 

The important thing to understand about anytime reliability is that it is not considered to be a proxy used 
to study encoder/decoder complexity as traditional reliability functions often are [8]. Instead, the anytime 
reliability parameter a indexes a sense of reliable transmission for a bitstream in which the probability 
of bit error tends to zero exponentially as time goes on. 

D. Main results 

The first result concerns the source coding problem illustrated in Fig. [3] for unstable Markov processes 
with bounded-support driving noise. 

Theorem 2.1: Assume both the source encoder and source decoder can be randomized. Given the 
unstable (|A| > 1) scalar Markov process from (OQ) driven by independent noise {Wt}t>o with bounded 
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support, it is possible to encode the process to average fidelity — Xi\ v ] arbitrarily close to d using 

two fixed-rate bitstreams and a suitably high end-to-end delay <p. 

The first stream (called the checkpoint stream) can be made to have rate R\ arbitrarily close to log 2 |A| 
while the second (called the historical stream) can have rate R 2 arbitrarily close to R^(d) — log 2 |A|. 
Proof: See Section [m] 

In a very real sense, the first stream in Theorem 12.11 represents an initial description of the process to 
some fidelity, while the second represents a refinement of the description [20]. These two descriptions 
turn out to be qualitatively different when it comes to communicating them across a noisy channel. 

Theorem 2.2: Suppose that a communication system provides uniform anytime reliability a > r]\og 2 |A| 
for the checkpoint message stream at bit-rate R\. Then given sufficient end-to-end delay cj), it is possible 
to reconstruct the checkpoints to arbitrarily high fidelity in the ^-distortion sense. 
Proof: See Section IV-Bl 

Theorem 2.3: Suppose that a communication system can reliably communicate message bits meeting 
any bit-error probability e given a long enough delay. Then, that communication system can be used to 
reliably communicate the historical information message stream generated by the fixed-rate source code 
of Theorem 12.11 in that the expected end-to-end distortion can be made arbitrarily close to the distortion 
achieved by the code over a noiseless channel. 
Proof: See Section IV-Cl 

The Gauss-Markov case with mean squared error is covered by corollaries: 

Corollary 2.1: Assume both the encoder and decoder are randomized and the finite end-to-end delay cf> 
can be chosen to be arbitrarily large. Given an unstable (|A| > 1) scalar Markov process (OQ) driven by iid 
Gaussian noise {l^tj^o with zero mean and variance a 2 , it is possible to encode the process to average 
fidelity — Xi\ 2 ] arbitrarily close to d using two fixed-rate bitstreams. 

The checkpoint stream can be made to have rate R 1 arbitrarily close to log 2 |A| while the historical 
stream can have rate R 2 arbitrarily close to R^(d) — log 2 |A|. 
Proof: See Section IVII-AI 

Corollary 2.2: Suppose that a communication system provides us with the ability to carry two message 
streams. One at rate Ri > log 2 |A| with uniform anytime reliability a > 21og 2 |A|, and another with 
classical Shannon reliability at rate R 2 > R^(d) — log 2 |A| where R^(d) is the rate-distortion function 
for an unstable Gauss-Markov process with unstable gain |A| > 1 and squared-error distortion. 

Then it is possible to successfully transport the two-stream code of Corollary 12. 1 1 using this communi- 
cation system by picking a sufficiently large end-to-end delay </>. The mean squared error of the resulting 
system will be as close to d as desired. 
Proof: See Section IVII-BI 

Theorems 12.21 and 12.31 together with the source code of Theorem 12.11 combine to establish a reduction 
of the d-lossy joint source/channel coding problem to the problem of communicating message bits at rate 
R(d) over the same channel, wherein a substream of message bits at rate ~ log 2 |A| is given an anytime 
reliability of at least rj\og 2 |A|. This reduction is in the sense of Section VII of [5]: any channel that is 
good enough to solve the second pair of problems is good enough to solve the first problem. 

The asymptotic relationship between the forward and backward rate-distortion functions is captured in 
the following theorem. 

Theorem 2.4: Let X be the unstable Markov process of (QQ) with |A| > 1 and let the stable backwards- 
in-time version from © be denoted X . Assume that the iid driving noise W has a Riemann-integrable 
density fw and there exists a constant K so that E[\ Yll=i A~*Wj| ?? ] < K for all t > 1. Furthermore 
for the purpose of calculating the rate-distortion functions below, assume that for the backwards-in-time 
version is initialized with X n = 0. Let Qa be the uniform quantizer that maps its input to the nearest 
neighbor of the form kA for integer k. 



s 



= (B) lim lim R*^ x °\d) = {b) lim R^ x °(d) = (c) R*(d) - log 2 |A|. (6) 

A— >0 n^oo n— >oo 

or expressed in terms of distortion-rate functions for R > log 2 |A|: 

D x (R) = D x (R-log 2 \X\). 

This implies that the process generally undergoes a phase transition from infinite to bounded average 
distortion at the critical rate log 2 |A|. 
Proof: See Section HV-Bl 

Notice that there are no explicitly infinite distortions in the original setup of the problem. Consequently, 
the appearance of infinite distortions is interesting as is the abrupt transition from infinite to finite 
distortions around the critical rate of log 2 |A|. This abrupt transition gives a further indication that there 
is something fundamentally nonclassical about the rate log 2 |A| information inside the process. 

To make this precise, a converse is needed. Classical rate-distortion results only point out that the 
mutual information across the communication system must be at least R(d) on average. However, as [21] 
points out, having enough mutual information is not enough to guarantee a reliable-transport capacity 
since the situation here is not stationary and ergodic. The following theorem gives the converse, but adds 
an intuitively required additional condition that the probability of excess average distortion over any long 
enough segment can be made as small as desired. 

Theorem 2.5: Consider the unstable process given by (OQ) with the iid driving noise W having a 
Riemann-integrable density fw satisfying the conditions of Theorem 12.41 

Suppose there exists a family (indexed by window size n) of joint source-channel codes (£ S ,V S )) so 
that the n-th member of the family has reconstructions that satisfy 

E[\X kn -X kn \ r >}<d (7) 

for every positive integer k. Furthermore, assume the family collectively also satisfies 

r+n.-l 

lim sup V(- V \Xi-Xi\i >d) = (8) 

n->oo T>0 n ^— ' 
— i=T 

so that the probability of excess distortion can be made arbitrarily small on long enough blocks. 

Then for any Ri < log 2 | A] , ck < r/log 2 |A|,_R 2 < R^(d) — log 2 |A|,P e > 0, the channel must support 
the simultaneous carrying of a bit-rate R\ priority message stream with anytime reliability a along with 
a second message stream of bit-rate R 2 with a probability of bit error < P e for some end-to-end delay 0. 
Proof: See Section IVTAl 

Note that a Gaussian disturbance W is covered by Theorems 12.41 and |2.5[ even if the difference distortion 
measure is not mean squared-error. 

E. An example and comparison to the sequential rate distortion problem 

In the case of Gauss-Markov processes with squared-error distortion, Hashimoto and Arimoto in [10] 
give an explicit way of calculating R(d). Tatikonda in [22], [23] gives a similar explicit lower bound 
to the rate required when the reconstruction X t is forced to be causal in that it can only depend on X, 
observations for j < t. 

Assuming unit variance for the driving noise W and A > 1, Hashimoto's formula is parametric in terms 
of the water-filling parameter k and for the Gauss-Markov case considered here simplifies to: 



D(k) = — I 



1 



1 - 2Acos(cj) + A 



R(k) = log 2 A + — / max 

Z7T 



du>, 

du. (9) 



2 1 k(1 - 2Xcos(lu) + A 2 ) 
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Fig. 4. The distortion-rate curves for an unstable Gauss-Markov process with A = 2 and its stable backwards-version. The stable and 
unstable D(R) curves are related by a simple translation by 1 bit per symbol. 

The rate-distortion function for the stable counterpart given in © has a water-filling solution that is 
identical to[9l except without the log 2 A term in the R(k) \ Thus, in the Gaussian case with squared error 
distortion direct calculation verifies the claim 

ig(d) = log 2 A + i2*(d) 

from Theorem 12.41 

For the unstable process, Tatikonda's formula for causal reconstructions is given by 

i?seq(rf) = ^log 2 (A 2 + ^). (10) 

Fig. |4] shows the distortion-rate frontier for both the original unstable process and backwards stable 
process. It is easy to see that the forward and backward process curves are translations of each other. 
In addition, the sequential rate-distortion curve for the forward process is qualitatively distinct. Dseq(-R) 
goes to infinity as R j log 2 A while D(R) approaches a finite limit. 

The results in this paper show that the lower curve for the regular distortion-rate frontier can be 
approached arbitrarily closely by increasing the acceptable (finite) end-to-end delay. This suggests that it 
takes some time for the randomness entering the unstable process through W to sort itself into the two 
categories of fundamental accumulation and transient history. The difference in the resulting distortion is 
not that significant at high rates, but becomes unboundedly large as the rate approaches log 2 A. It is open 
whether similar information-embedding theorems similar to Theorem 12.51 exist that give an operational 
meaning to the gap between R seq (d) and R(d). If a communication system can be used to satisfy distortion 
d in a causal way, does that mean the underlying communication resources also must be able to support 
messages at this higher rate R seq (d)l 

III. TWO STREAM SOURCE ENCODING: APPROACHING R(d) 
This section proves Theorem 12.11 
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Fig. 5. A flowchart showing how to do fixed-rate source coding for Markov sources using two streams and how the streams are decoded. 



A. Proof strategy 

The code for proving Theorem 12. II is illustrated in Fig.[5l Without loss of generality, assume A = |A| > 1 
to avoid the notational complication of keeping track of the sign. 

• Look at time in blocks of size n and encode the values of endpoints (X kn _i, X kn ) recursively to 
very high precision using rate n(log 2 A + e x ) per pair. Each block X kn , X kn+ i, . . . , X( fc+ i) n _i will 
have encoded checkpoints (X kn , X kn+n _i) at both ends. 

• Use the encoded checkpoints {X kn } at the start of the blocks to transform the process segments in 
between (the history) so that they look like an iid sequence of finite horizon problems X. 

• Use the checkpoints {X kn+n ^i) at the end of the blocks as side-information to encode the history 
to fidelity d at a rate of n(R^(d) — log 2 A + e 2 + o(l)) per block. 

• "Stationarize" the encoding by choosing a random starting offset so that no times t are a priori more 
vulnerable to distortion. 

The source decoding proceeds in the same manner and first recovers the checkpoints, and then uses them 
as known side-information to decode the history. The two are then recombined to give a reconstruction 
of the original source to the desired fidelity. 

The above strategy follows the spirit of Berger's encoding[l 1]. In Berger's code for the Wiener process, 
the first stream's rate is negligible relative to that of the second stream. In our case, the first stream's rate 
is significant and cannot be averaged away by using large blocks n. 

The detailed constructions and proof for this theorem are in the next few subsections, with some 
technical aspects relegated to the appendices. 

B. Recursively encoding checkpoints 

This section relies on the assumption of bounded support for the driving noise \W t \ < ^, but does 
not care about any other property of the {Wt}t>o like independence or stationarity. The details of the 
distortion measure are also not important for this section. 
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Proposition 3.1: Given the unstable (A > 1) scalar Markov process of CD driven by noise {Wt} t >o 
with bounded support, and any A > 0, it is possible to causally and recursively encode checkpoints 



spaced by n so that \X kn — X kn \ < For any Ri > log 2 A, this can be done with rate nRi bits per 
checkpoint by choosing n large enough. Furthermore, if an iid sequence of independent pairs of continuous 
uniform random variables {9;, 6;}i>o is available to both the encoder and decoder for dithering, the errors 
[X kn _i— X kn _i, X kn —X kn ) can be made an iid sequence of pairs of independent uniform random variables 
on [-f,+f]. 

Proof: First, consider the initial condition at X . It can be quantized to be within an interval of size A 
by using log 2 [^] bits. 

With a block length of n, the successive endpoints are related by: 



X( k+ i) n = \ n X kn + [A n 1 ^2 A l W kn+i ] (11) 

i=0 

The second term [■ • ■ ] on the left of (fTTI) can be denoted W k and bounded using 

\W k \ = lA^^A-H^Wl < lA^I^A-^ < A w ( " ■ (12) 

Proceed by induction. Assume that X kn satisfies \X kn — X kn \ < y. This clearly holds for k = 0. Without 
any further information, it is known that X( k+ iy t must lie within an interval of size A" A + A n ^-. By 
using nR[ bits (where R[ is chosen to guarantee an integer nR[) to encode where the true value lies, the 
uncertainty is cut by a factor of 2 ni? 'i . To have the resulting interval of size A or smaller, we must have: 

A > 2^'iA w (A+ ^ ). 

(A - 1) 

Dividing through by A2~ nR 'iA n and taking logarithms gives 

n(R[ -log 2 X) >log 2 (l 



a(a-i; 



Encoding X kn _i given X kn requires very little additional rate since \X kn _i — \~ l X kn \ < O + A and so 
log 2 [£ + 1] < log 2 (2 + £) additional bits are good enough to encode both checkpoints. Putting everything 
together in terms of the original i?i gives 

Rl > m ax ( lo g2 A + ' 0fe(1 + ^ + '° fa(2 + 5 , ^IM] . (13, 

It is clear from (fT3l) that no matter how small a A we choose, by picking an n large enough the rate 
Ri can get as close to log 2 A as desired. In particular, picking n = i^(log 2 -^-) 2 works with large K and 
small A. 

To get the uniform nature of the final error X kn — X kn , subtractive dithering can be used [24]. This 
is accomplished by adding a small iid random variable 0*,-, uniform on [— ^ ,+y], to the X kn , and only 
then quantizing (X kn + @fc) to resolution A. At the decoder, Q k is subtracted from the result to get X kn . 
Similarly for X kn ^i. This results in the checkpoint error sequence (X kn _i — X kn _i, X kn — X kn ) being 
iid uniform pairs over [— ^, These pairs are also independent of all the W t and initial condition X . 
□ 

In what follows, we always assume that A is chosen to be of high fidelity relative to the target distortion 
d (e.g. For squared-error distortion, this means that A 2 d.) as well as small relative to the the initial 
condition so A f2 . 
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C. Transforming and encoding the history 

Having dealt with the endpoints, focus attention on the historical information between them. Here the 
bounded support assumption is not needed for the {W*}, but the iid assumption is important. First, the 
encoded checkpoints are used to transform the historical information so that each historical segment looks 
iid. Then, it is shown that these segments can be encoded to the appropriate fidelity and rate when the 
decoder has access to the encoded checkpoints as side information. 

1) Forward transformation: The simplest transformation is to effectively restart the process at every 
checkpoint and view time going forward. This can be considered normalizing each of the historical 
segments X^ +1)n_1 to (X (M) , < i < n - 1) for k = 0, 1,2, . . .. 

X(k,i) — X kn+i — XX kn (14) 

For each k, the block X k = {X( fc>i )} <i< n -i satisfies X^,i+i) = XX^,i) + W(k,i)- By dithered quan- 
tization, the initial condition (i = 0) of each block is a uniform random variable of support A that is 
independent of all the other random variables in the system. The initial conditions are iid across the 
different k. Thus, except for the initial condition, the blocks X k are identically distributed to the finite 
horizon versions of the problem. 

Since A < Q , each X k block starts with a tighter initial condition than the original X process did. 
Since the initial condition is uniform, this can be viewed as a genie-aided version of the original problem 
where a genie reveals a few bits of information about the initial condition. Since the initial condition 
enters the process dynamics in a linear way and the distortion measure p depends only on the difference, 
this implies that the new process with the smaller initial condition requires no more bits per symbol to 
achieve a distortion d than did the original process. Thus: 

1 -I- lno- 

R*(d) - 1 + iQg2A < < R x n {d) 

n 

for all n and d. So in the limit of large n 

Rl(d) = R^{d). (15) 

In simple terms, the normalized history behaves like the finite horizon version of the problem when n 
is large. 

2) Conditional encoding: The idea is to encode the normalized history between two checkpoints 
conditioned on the ending checkpoint. The decoder has access to the exact values of these checkpoints 
through the first bitstream. 

For a given k, shift the encoded ending checkpoint Xt k+1 ) n _i to 

Zl = X(fc+l) n -l — A™ 1 X kn . (16) 

Zl is clearly available at both the encoder and the decoder since it only depends on the encoded 
checkpoints. Furthermore, it is clear that 

X(k,n-1) ~ Z\ = (X( fc+1 ) n _! — A" 1 X kn ) — (X( fc+1 ) n _! — A™ l X kn ) = X( fc+1 ) n _! — X( fc+1 ) n _! 

which is a uniform random variable on [— ^ , +^]- Thus Z\ is just a dithered quantization to A precision 
of the endpoint Xi k>n _i\. 

Define the conditional rate-distortion function RoJ ' (d) for the limit of long historical blocks X^ 
conditioned on their quantized endpoint as 

R^j zq ' e (d) =liminf- inf -^W" 1 ; Y^'^Z", 6). (17) 

n->oc n {p(y o "- 1 |x o "- 1 ,Z'j,0):iEL"o l£; [l^- y il' 7 ]< d } n 
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Proposition 3.2: Given an unstable (A > 1) scalar Markov process {X kt } obeying (QQ) and whose 
driving noise satisfies E[\ X)i=i < K f° r all t > 1 for some constant K, together with its 

encoded endpoint Z\ obtained by 6-dithered quantization to within a uniform random variable with small 
support A, the limiting conditional rate-distortion function 

R^ zq > e (d) = R*(d)~ \og 2 \. (18) 

Proof: See Appendix Hill 

The case of driving noise with bounded support clearly satisfies the conditions of this proposition since 
geometric sums converge. The conditional rate-distortion function in Proposition 13.21 has a corresponding 
coding theorem: 

Proposition 3.3: Given an unstable (A > 1) scalar Markov process {X t } given by (QQ) together with 
its n- spaced pairs of encoded checkpoints {X} obtained by dithered quantization to within iid uniform 
random variables with small support A, for every 64 > there exists an M large enough so that a 
conditional source-code exists that maps a length M superblock of the historical information {X k } < k<M 
into a superblock {T k }o<k<M satisfying 

j Af-l j n 

M E - E E WXM> %;))] < d + e 4 . (19) 
k=o 3=1 

By choosing n large enough, the rate of the superblock code can be made as close as desired to 
R^(d) — log 2 A if the decoder is also assumed to have access to the encoded checkpoints X kn . 

Proof: M of the Xk blocks are encoded together using conditioning on the encoded checkpoints at the 
end of each block. The pair (Xk, Zfy have a joint distribution, but are iid across k by the independence 
properties of the subtractive dither and the driving noise Wf k ,i)- Furthermore, the X/^ are bounded and 
as a result, the all zero reconstruction results in a bounded distortion on the X vector that depends on n. 
Even without the bounded support assumption, Theorem 12.41 reveals that there is a reconstruction based 
on the Z\ alone that has bounded average distortion where the bound does not even depend on n. 

Since the side information Z\ is available at both encoder and decoder, the classical conditional rate- 
distortion coding theorems of [25] tell us that there exists a block-length M (n) so that codes exist satisfying 
(fT9l ). The rate can be made arbitrarily close to Rn ' Z, ' (<i). By letting n get large, Proposition 13.21 reveals 
that this rate can be made as close as desired to R^(d) — log 2 A. □ 

D. Putting history together with checkpoints 

The next step is to show how the decoder can combine the two streams to get the desired rate/distortion 
performance. 

The rate side is immediately obvious since there is log 2 A from Proposition 13.11 and R^ (d) — log 2 A 
from Proposition 13.31 The sum is as close to R^(d) as desired. On the distortion side, the decoder runs 
(fT4l) in reverse to get reconstructions. Suppose that Ta{\ are the encoded transformed source symbols 
from the code in Proposition [33] Then X kn+i = T {Ki) + \ l X kn and so X kn+i - X kn+i = X {Ki) - T {Ki) . 
Since the differences are the same, so is the distortion. 

E. "Stationarizing" the code 

The underlying X t process is non- stationary so there is no hope to make the encoding truly stationary. 
However, as it stands, only the average distortion across each of the Mn length superblocks is close to d 
in expectation giving the resulting code a potentially "cyclostationary" character. Nothing guarantees that 
source symbols at every time will have the same level of expected fidelity. To fix this, a standard trick 
can be applied by making the encoding have two phases: 
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• An initialization phase that lasts for a random T time-steps. T is a random integer chosen uniformly 
from 0, 1, . . . Mn — 1 based on common randomness available to the encoder and decoder. During the 
first phase, all source symbols are encoded to fidelity A recursively using the code of Proposition [3J] 
with n = 1. 

• A main phase that applies the two-part code described above but starts at time T + 1 . 

The extra rate required in the first phase is negligible on average since it is a one-time cost. This takes 
a finite amount of time to drain out through the rate R\ message stream. This time can be considered an 
additional delay that must be suffered for everything in the second phase. Thus it adds to the delay of n 
required by the causal recursive code for the checkpoints. The rest of the end-to-end delay is determined 
by the total length Mn of the superblock chosen inside Proposition 13.31 

Let di be such that the original super-block code gives expected distortion di at position i ranging from 
to Mn — 1. It is known from Proposition 13.31 that 5^o _1 di < d + e 4 . Because the first phase is 
guaranteed to be high fidelity and all other time positions are randomly and uniformly assigned positions 
within the superblock of size Mn, the expected distortion — Xi\ v ] < d + e 4 for every bit position i. 

The code actually does better than that since the probability of excess average distortion over a long 
block is also guaranteed to go to zero. This property is inherited from the repeated use of independent 
conditional rate-distortion codes in the second stream [25]. 

This completes the proof of Theorem 12.11 

IV. Time-reversal and the essential phase transition 

It is interesting to note that the distortion of the code in the previous section turns out to be entirely 
based on the conditional rate-distortion performance for the historical segments. The checkpoints merely 
contribute a log 2 A term in the rate. 

The nature of historical information in the unstable Markov process described by (OQ) can be explored 
more fully by transforming the historical blocks going locally backward in time. The informational 
distinction between the process going forward and the purely historical information parallels the concepts 
of information production and dissipation explored in the context of the Kalman Filter [15]. 

First, the original problem is formally decomposed into forward and backward parts. Then, Theorem [2~4l 
is proved. 

A. Endpoints and history 

It is useful to think of the original problem as being broken down into two analog sub-problems: 

1) The n-endpoint problem: This is the communication of the process {Xk n } where each sample arrives 
every n time steps and the samples are related to each other through (fTTI) with W k being iid and having 
the same distribution as A n_1 YXss 

This process must be communicated so that E[\X kn — X fcn |' ? ] < K for some performance K. This is 
essentially a decimated version of the original problem. 

2) The conditional history problem: The stable X process defined in © can be viewed in blocks 
of length n. The conditional history problem is thus the problem of communicating an iid sequence of 
ro-vectors Xj~ = (X k,i, ■ ■ ■ , X k, n -i) conditioned on iid Zk that are known perfectly at the encoder and 
decoder. The joint distribution of X~ , Z are given by: 

i=0 

A _1 X t+ i - \~ l W t 
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where the underlying {W t } are iid. Unrolling the recursion gives X t = — ^2^=0 * ^ 1 1 Wt+i- The Z is 
thus effectively the endpoint Z = X . The vectors are made available to the encoder every n time 
units along with their corresponding side-information Z k . The goal is to communicate these to a receiver 

that has access to the side-information Z k so that - Y^i=i ^[p(-^k,u^ki)] — d f° r a ^ ^- 

The relevant rate distortion function for the above problem is the conditional rate-distortion function 

Rn (d)- The proof of Theorem 12.11 in the previous section involves a slightly modified version of the 
above where the side-information Z is known only to some quantization precision A. The quantized 

side-information is Z q = Q The relevant conditional rate-distortion function is _R„ 1 ' (d). 

3) Reductions back to the original problem: It is obvious how to put these two problems together to 
construct an unstable {X t } stream: the endpoints problem provides the skeleton and the conditional history 
interpolates in between. To reduce the endpoints problem to the original unstable source communication 
problem, just use randomness at the transmitter to sample from the interpolating distribution and fill in 
the history. 

To reduce the conditional history problem to the original unstable source communication problem, just 
use the iid Z k to simulate the endpoints problem and use the interpolating X history to fill out {X t }. 
Because the distortion measure is a difference distortion measure, the perfectly known endpoint process 
allows us to translate everything so that the same average distortion is attained. 



B. Rate-distortion relationships proved 

Theorem 12.41 tells us that the unstable |A| > 1 Markov processes are nonclassical only as they evolve 
into the future. The historical information is a stable Markov process that fleshes out the unstable skeleton 
of the nonstationary process. This fact also allows a simplification in the code depicted in Fig. [51 Since the 
side-information does not impact the rate-distortion curve for the stable historical process, the encoding 
of the historical information can be done unconditionally and on a block-by-block basis. There is no need 
for superblocks. 

The remainder of this section proves Theorem 12.41 

Proof: 

1) (a): It is easy to see that R*(d) = lim^^ Rn lQA{X °\d) since the endpoint X is distributed like 
— Yli=i an d has a finite 77-th moment by assumption. By Lemma [2TT1 (in the Appendix), the entropy 
of Qa(Xq) is bounded below a constant that depends only on the precision A. This finite number is then 
amortized away as n — »■ 00. 

2) (b): Next, we show 

lim R*} Q ^x°\d) = lim R^°(d). (20) 

A — >0 n— »oo 

^ — 'x I 'x 

For notational convenience, let Z q = Qa(Xo). First, R n °(d) is immediately bounded above by 

*X I ^ — 

since knowledge of X exactly is better than knowledge of only the quantized Z q . To get a 

lower bound, imagine a hypothetical problem that is one time-step longer and consider the choice between 

knowing X to fine precision A or knowing X-i exactly. 

R xV\x_ l{d) > ( . } ^-^-uZ"^) 

> (ii) rxT^-i&MW'^ 

where (i) and (ii) above hold since added conditioning can only reduce the conditional rate-distortion 
function, and C 7 , G$, are from the following lemma applied to the hypothesized W_i driving noise. 

Lemma 4.1: Given a random variable W with density fw, arbitrary 1 > 7 > 0, there exists a 5 > so 
that it is possible to realize W as 

W = (1-CJ(G s + U s ) + CtWZ (21) 
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where 

• C 7 is a Bernoulli random variable with probability 7 of being 1 . 

• Us is a continuous uniform random variable on [—§,+§]• 

• G$ and W' s ' are some random variables whose distributions depend on f w , S, 7. 

• C^,Us,Gs, Wg are all independent of each other. 
Proof: See Appendix HI 



Pick 7 small and then choose A < 5. Notice that X_ x = A _1 X - A _1 (l - C 7 )(G 5 + U 5 ) + A" 1 C 7 W^ / 
where C 7 , Us, Gs, W 5 ' are independent of each other as well as the entire vector X . Because the {X t } 
process is Markov, the impact of the observations Z q , C 7 , Gs, W's on the conditional rate-distortion 
function is factored entirely through the posterior distribution for X . 

There are two cases: 

• C 7 = 1 The value for X is entirely revealed by the observations. The posterior is a Dirac delta. 

• C 7 = There are two independent measurements of X . The first is the quantization Z q . The second 
is AJ_i + Gs = X — Us- This is just X blurred by uniform noise. 

It is useful to view them as coming one after the other. After seeing Z q = Qa(^o) = z i, me posterior 
distribution V(X \Z q = Zl ) has support only within [z\ — ^, Z\ + A -]. 

The distribution V(Z 2 \Z q = Z\) for the second observation Z 2 = X — Us conditioned on the first 
observation has a pair of interesting properties. First, it has support only on \z\ — ^=,21 + ]. 

Second, the distribution is uniform over the interval ( Zl - ^A z x + ^A) since the V(X \Z q = Zl ) 
has support with total span A <C 5. 

Consider the posterior V(X \Z q = z\, Z 2 = z 2 ) for z 2 G (zi — ^y^, z\ + ^p-) and apply Bayes rule: 



P(Z 2 = z 2 |Z« = z x ) 
dV(X <x,Z 2 = z 2 \Z q = Zl ] 



[bV{Z 2 = z 2 \Z q = Zl , X < xj) V(X < x\Z q = Zl ] 
V(X < x\Z q = Zl ). 



So if it lands in this region, the second observation is useless. Notice that Us G ( - 2 2A , 5 2 2A ) forces 
the second observation to be inside this region. Thus the second observation is useless with probability 

c . < 71— 1 

at least (1 — 7) ~r regardless of what the actual X are. 
Define a new hypothetical observation Z' that with probability (1 — 7) s ~g A is just equal to Z q and is 
equal to X otherwise. The above tells us that this is a more powerful observation than than the original 

(X_ 1 ,Z q ,C y ,Gs,W^. Thus 



1 - 7 )*-^fl*T>(d) + (l - (1 - 7)^-^) R^ % (d) 



f) — 9 A 

> (i_ 7 )£_ffi^o 
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Simple algebra then reveals that 



n 



< 



n(l-7)<y-2A 



n(l-7)<y-2A' 

Taking the limits of n — > oo, — > 0, <5— >0, 7 — > establishes the desired result. 
Notice that an identical argument works to show that 

lim lim R^ z \d) = lim R^ Xn (d) 

A— >0n— >oo n— >oo 

for the forward unstable process. It does not matter if it is conditioned on the exact endpoint or a finely 
quantized version of it. Notice also that the argument is unchanged if the quantization was dithered rather 
than undithered. 

3) (c): This follows almost immediately from (fl"8l) from Proposition 13.21 The only remaining task is 
to show that 

It is clear that the iid {Z k } in the "conditional history" problem are just scaled-down (by a factor of 
versions of the {W^.} from the "endpoints" problem. The forward X k = (X^ i, . . . , Xk, n -i) can 
be recovered using a simple translation of X^ by the vector (Z k , XZ kl . . . , X n ~ 1 Z k ) since 

i-l 

X t = ^X^Wi 

i=0 

n— 1 n—1 

= Y, xt ^ lw ^-/Z xt ~ l ~ 1Wi 

i=0 i=t 

n—1 n—l—t 

i=0 i=0 

Similarly, the conditional history problem can be recovered from the forward one by another simple 
translation of X k by the vector (—X~^ n ~^ Z k , . . . , — X~ 1 Z kl —Z k ). 

Thus, the problem of encoding the conditional history to distortion d conditioned on its endpoints is 
the same whether we are considering the unstable forward or stable backwards processes. 

4) Phase transition: At rates strictly less than log 2 A, the distortion for the original X process is 
necessarily infinite. This is shown in Lemma 16.21 where finite distortion implies the ability to carry 
~ log 2 A bits through the communication medium. □ 



V. Quality of service requirements for communicating unstable processes: 

SUFFICIENCY 

In Section IV-A1 the sense of anytime reliability is reviewed from [5] and related to classical results on 
sequential coding for noisy channels. Then in Section IV-B[ anytime reliable communication is shown to 
be sufficient for protecting the encoding of the checkpoint process, thereby proving Theorem 12.21 Finally 
in Section IV-Q it is shown that it is sufficient to communicate the historical information using traditional 
Shannon e-reliability, thereby proving Theorem 12.31 
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Fig. 6. A channel encoder viewed as a tree. At every integer time, each path of the tree has a channel input symbol. The path taken down 
the tree is determined by the message bits to be sent. Infinite trees have no intrinsic target delay and bit/path estimates can get better as 
time goes on. 



A. Anytime reliability 

It should be clear that the encoding given for the checkpoint process in Section IIII-BI is very sensitive 
to bit errors since it is decoded recursively in a way that propagates errors in an unbounded fashion. To 
block this propagation of errors, the channel code must guarantee not only that every bit eventually is 
received correctly, but that this happens fast enough. This is what motivates the definition of anytime 
reliability given in Definition 12.61 The relationship of anytime reliability to classical concepts of error 
exponents as well as bounds are given in [7], [5]. 

Here, the focus is on the case where there is no explicit feedback of channel outputs. Consider maximum- 
likelihood decoding [26] or sequential-decoding [28] as applied to an infinite tree code like the one 
illustrated in Fig.[6j The estimates Bi(t) describe the current estimate for the most likely path through the 
tree based on the channel outputs received so far. Because of the possibility of "backing up," in principle 
the estimate for Bi could change at any point in time. The theory of both ML and sequential decoding 
tells us that generically, the probability of bit error on bit i approaches zero exponentially with increasing 
delay. 

In traditional analysis, random ensembles of infinite tree codes are viewed as idealizations used to study 
the asymptotic behavior of finite sequential encoding schemes such as convolutional codes. We can instead 
interpret the traditional analysis as telling us that random infinite tree codes achieve anytime reliability. In 
particular, we know from the analysis of [26] that at rate R bits per channel use, we can achieve anytime 
reliability a equal to the block random coding error exponent. Pinsker's argument in [29] as generalized 
in [7] tells us also that we cannot hope to do any better, at least in the high-rate regime for symmetric 
channels. We summarize this interpretation in the following theorem: 

Theorem 5.1: Random anytime codes exist for all DMCs For a stationary discrete memory less channel 
(DMC) with capacity C, randomized anytime codes exist without feedback at all rates R < C and have 
anytime reliability a = E r (R) where E r (R) is the random coding error exponent as calculated in base 2. 

Proof: See Appendix [IV] 
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B. Sufficiency for the checkpoint process 

The effect of any bit error in the checkpoint encoding of Section IIII-BI will be to throw us into a wrong 
bin of size A. This bin can be at most A n ^y away from the true bin. The error will then propagate and 
grow by a factor A n as we move from checkpoint to checkpoint. 

If we are interested in the 77— difference distortion, then the distortion is growing by a factor of X nv 
per checkpoint, or a factor of X v per unit of time. As long as the probability of error on the message bits 
goes down faster than that, the expected distortion will be small. This parallels Theorem 4.1 in [5] and 
results in this proof for Theorem 12.21 

Proof: Let X kn ((j)) be the best estimate of the checkpoint X kn at time kn + (p. By the anytime reliability 
property, grouping the message bits into groups of nR\ at a time, and the nature of exponentials, it is 
easy to see that there exists a constant K' so that: 

E[\X' kn {<t>) ~ Xkn\ V ] < 



< 



where K'" is a constant that depends on the anytime code, rate R\, support f2, and unstable A. Thus by 
making sure a > r/log 2 A and choosing cf) large enough, 2~ a< ^ will become small enough so that K'"2~ a ^ 
is as small as we like and the checkpoints will be reconstructed to arbitrarily high fidelity. □ 

Theorem 12.21 applies even in the case that A = 1 and hence answers the question posed by Berger 
in [12] regarding the ability to track an unstable process over a noisy channel without perfect feedback. 
Theorem 15.11 tells us that it is in principle possible to get anytime reliability without any feedback at all, 
and thus also with only noisy feedback. 

This idea of tracking an unstable process using an anytime code is useful beyond the source-coding 
context. In [30], [31], [32], anytime codes are used over a noisy feedback link to study the reliability 
functions for communication using ARQ schemes and expected delay. The sequence numbers of blocks 
are considered to be an unstable process that needs to be tracked at the encoder. The random requests for 
retransmissions make it behave like a random walk with a forward drift, but that can stop and wait from 
time to time. 

C. Sufficiency for the history process 

It is easy to see that the history information for the two stream code does not propagate errors from 
superblock to superblock and so does not require any special QoS beyond what one would need for an 
iid or stationary-ergodic process. This is the basis for proving Theorem 12.31 

Proof: Since the impact of a bit error is felt only within the superblock, no propagation of errors 
needs to be considered. Theorem 12.41 tells us that there is a maximum possible distortion on the historical 
component. Thus the standard achievability argument [8] for D(R) tells us that as long as the probability 
of block error can be made arbitrarily small e with increasing block-length, then the additional expected 
distortion induced by decoding errors will also be arbitrarily small. The desired probability of bit error 
can then be set to be e divided by the superblock length. □ 

The curious fact here is that the QoS requirements of the second stream of messages only need to hold 
on a superblock-by- superblock basis. To achieve a small ensemble average distortion, there is no need 
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to have a secondary bitstream available with error probability that gets arbitrarily small with increased 
delay! The secondary channel could be nonergodic and go into outage for the entire semi-infinite length 
of time as long as that outage event occurs sufficiently rarely so that the average on each superblock is 
kept small. Thus the second stream of messages is compatible with the approach put forth in [33]. 

VI. Quality of service requirements for communicating unstable processes: necessity 

The goal is to prove Theorem 12.51 by showing that unstable Markov processes require communication 
channels capable of supporting two-tiered service: a high priority core of rate log 2 A with anytime-reliability 
of at least 7?log 2 A, and the rest with Shannon reliable bit-transport. To do this, this section proceeds in 
stages and follows the asymptotic equivalence approach of [5]. 

This section builds on Section IIV-AI where the pair of communication problems (the endpoint com- 
munication problem and conditional history communication problem) were introduced. In Section IVI-A1 
it is shown that the anytime-reliable bit-transport problem reduces to the first problem (endpoint com- 
munication) in the pair. Then Section IVI-BI finishes the necessity argument by showing how traditional 
Shannon-reliable bit-transport reduces to the second problem and that the two of them can be put together. 
This reduces a pair of data-communication problems — anytime-reliable bit transport and Shannon-reliable 
bit-transport — to the original problem of communicating a single unstable process to the desired fidelity. 

The proof construction is illustrated in Fig. [7J Two message streams need to be embedded — a priority 
stream that requires anytime reliability and a remaining stream for which Shannon-reliability is good 
enough. The priority stream is used to generate the endpoints while the the history part is filled in with 
the appropriate conditional distribution. This simulated process is then run through the joint source-channel 
encoder 8 S to generate channel inputs. The channel outputs are given to the joint source-channel decoder 
V s which produces, after some delay </>, a fidelity d reconstruction of the simulated unstable process. By 
looking at the reconstructions corresponding to the endpoints, it is possible to recover the priority message 
bits in an anytime reliable fashion. With these in hand, the remaining stream can also be extracted from 
the historical reconstructions. 

A. Necessity of anytime reliability 

We follow the spirit of information embedding[34] except that we have no a-priori covertext. Instead we 
use a simulated unstable process that uses common randomness and without loss of generality, message 
bits assumed to be from iid coin tosses. If the message bits were not fair coin tosses to begin with, XOR 
them with a one-time pad using common randomness before embedding them. This section parallels the 
necessity story in [5], except that in this context, there is the additional complication of having a specified 
distribution for the {W*}, not just a bound on the allowed \W t \. 

The result is proved in stages. First, we assume that the density of W is a continuous uniform random 
variable plus something independent. After that, this assumption is relaxed to having a Riemann-integrable 
density f w . 

1) Uniform driving noise: 

Lemma 6.1: Assume the driving noise W = G + Us where G,U$ are independent random variables 
with Us being a uniform random variable on the interval [—§,+§] for some 5 > 0. 

If a joint source-channel encoder/decoder pair exists for the endpoint process given by (fTTj) that achieves 
© for every position kn, then for every rational rate R = — < log 2 A, there exists a randomized anytime 
code for the channel that achieves an anytime reliability of a = rj log 2 A. 

Proof: The goal is to simulate the the endpoint process using the message bits and then to recover the 
message bits from the reconstructions of the endpoints. Pick the initial condition X using common 
randomness so it can be ignored in what follows. 
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Fig. 7. Turning a joint-source-channel code into a two-stream code using information embedding. The good joint-source-channel code is 
like an attacker that will not impose too much distortion. Our goal is to simulate a source that carries our messages so that they can be 
recovered from the attacker's output. 



At the encoder, the goal is to simulate 

n-l 
i=0 

71-1 

= A^Xo+A^^A-W^ 

n-l 

i=0 
n-l 

= U xn -i Stk + [X n -\G k + A-W fc ,)] 

i=0 

The [A n_1 (Gfc + ^™T 1 \~ l Wk,i)} term is simulated entirely using common randomness and is hence known 
to both the transmitter and receiver. The U\n-i S k term is a uniform random variable on [— A " 2 S , + A " 2 s ] 
and is simulated using a combination of common randomness and the fair coin tosses coming from the 
message bits. 
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Fig. 8. The priority message bits are used to refine a point on a Cantor set. The natural tree structure of the Cantor set construction allows 
us to encode bits sequentially. The Cantor set also has finite gaps between all points corresponding to bit sequences that first differ in a 
particular bit position. These gaps allow us to reliably extract bit values from noisy observations of the Cantor set point regardless of which 
point it is. 



Since a uniform random variable has a binary expansion that is fair coin tosses, we can write U\n-is :k = 
A " 2 s YlT=A\YSk,i where the Sk,i are iid random variables taking on values ±1 each with probability |. 

The idea is to embed the iid nR message bits into positions I — 1,2,..., nR while letting the rest — 
a uniform random variable U' S2nR k representing the semi-infinite sequence of bits (S^nR+i, Sk,nR+2, ■ ■ •) 
— be chosen using common randomness. The result is: 

X n—l 

W k = X'^-Mk + [\ n -\U' 52n a k G k + (22) 

i=0 

where M k is the nR bits of the message as represented by 2 nR equally likely points in the interval [—1, +1] 
spaced apart by 2 1 ~ ni? , and the rest of the terms [• ■ ■ ] are chosen using common randomness known at 
both the transmitter and receiver side. 

Since the simulated endpoints process is a linear function of the {W k } and the distortion measure is 
a difference distortion, it suffices to just consider the {X' kn } process representing the response to the 
discrete messages {M k } alone. This has a zero initial condition and evolves like 

X' (k+1)n = X n X' kn + (3M k (23) 
where (3 = A™ _1 |. Expanding this recursion out as a sum gives 

k 

X' (k+l)n = (X n ) k pJ2 X ~ niM ^- ( 24 ) 

This looks like a generalized binary expansion in base A n and therefore implies that the X' process takes 
values on a growing Cantor set (illustrated in Fig. [8] for nR = 1) 
The key property is that there are gaps in the Cantor set: 

Property 6.1: If the rate R < log 2 A H — 1 anc i t_ ne message- streams M and M first differ at 

position j (message Mj ^ Mj), then at time k > j, the encoded X' kn and X' kn corresponding to Mf -1 
and Mf -1 respectively differ by at least: 

\X> kn -X> kn \<K\< k -» (25) 

for some constant K > that does not depend on the values of the message bits, k, or j. 
Proof: See Appendix |V] 

In coding theory terms, Property 16.11 can be interpreted as an infinite Euclidean free-distance for the 
code with the added information that the distance increases exponentially as \ n ( k ~^. Thus, a bit error can 
only happen if the received "codeword" is more than half the minimum distance away. 

At the decoder, the common randomness means that the estimation error X kn — X kn is the error in 
estimating X' kn . By applying Markov's inequality to this using ©, we immediately get a bound on the 
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probability of an error on the prefix Mq for i < k: 



V{M{{kn) ± Ml) < 



n\xL-x' kn \> Y x n{k - l) ) 

V{\X kn -X kn \>^X n ^) 

v(\x kn -x kn \^>(^y(x n ^) 



< 



^/2-(»7log 2 A)n(fc-i) 



But n(k — i) is the delay that is experienced at the nf?-bit message level. If bits have to be buffered-up to 
form messages, then the delay at the bit level includes another constant n. This only increases the constant 
K' further but does not change the exponent with large delays. Thus, the desired anytime reliability is 



2) General driving noise: Lemma [67TI can have the technical smoothness condition weakened to simply 
requiring a Riemann-integrable density for the white W driving process. 

Lemma 6.2: Assume the driving noise W has a Riemann-integrable density fw. If there exists a family 
of joint source-channel encoder/decoder pairs for a sequence of increasing n-endpoint problems given 
by (fTTI) that achieve CD) for every position kn, then for every rate R < log 2 A and anytime reliability 
a < 7] log 2 A, there exists a randomized anytime code for the underlying channel. 

Proof: Since the density is Riemann-integrable, Lemma 14.11 applies. Choose 5 such that 7 < X~ 2r,n . 
When simulating Wk,o in the endpoint process, use common randomness for C 7 and Wg, and follow the 
procedure from the proof of Lemma [6T| for G$ and Us- 

We can thus interpret a "heads" for C 7 as an "erasure" with probability 7 since no message can be 
encoded in that time period. From the point of view of Lemma 16.11 this can be considered a known null 
message. 

Since the outcome of these coin tosses come from common randomness, the position of these erasures 
are known to both the transmitter and the receiver. In this way, it behaves like a packet erasure channel 
with feedback. This problem is studied in Theorem 3.3 of [7], and the delay-optimal coding strategy 
relative to the erasure channel is to place incoming packets into a FIFO queue awaiting a non-erased 
opportunity for transmission. The following lemma summarizes the results needed from [7]. 

Lemma 6.3: Suppose packets arrive deterministically at a rate of R packets per unit time and enter a 
FIFO queue drained at constant rate 1 per unit time. 

• Suppose 7 < If each packet has a size distribution that is bounded below a geometric (1 — 7) 
(i.e. ^(Size > s) < 7 s for all non-negative integers s), then the random delay </> experienced by 
any individual packet from arrival to departure from the queue satisfies V(4> > s) < K2~ as for all 
non-negative s and some constant K that does not depend on s. Furthermore, if R < for some 
r > 0, then a > — log 2 7 — 2^ r ' . 

• Assume the rate R = - and each packet has a size distribution that is bounded by: ^(Size > n(l — 
e) + s ) < 7 s for all non-negative integers s. Then the delay (ft experienced by any individual packet 
has a tail distribution bounded in the same way as for R' — — an d packets with geometric (1 — 7) 
size. That is V{<p > s) < K2~ as where a > — log 2 7 — 2j n ^ . 

Proof: See Theorem 3.3 and Corollary 6.1 of [7]. 

For our problem, the message bits are arriving deterministically at bit-rate R < log 2 A per unit time 
to the transmitter. Pick r > small enough so that R' = (1 + 3r)R < log 2 A. Group message bits into 
packets of size nR'. These packets arrive deterministically at rate < packets per n time units. 



obtained. 



□ 
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Thus, Lemma 16.31 applies and the delay (in n units) experienced by a packet in the queue has a delay 
error exponent a of least 

-log 27 - 2 7 r > -log 2 A" 2 ""-2A- 2 ^ 
= n2i 1 \og 2 X-2X~ 2r)nr 

per n time steps or 2r] log 2 A — 2A ^ nr per unit time step. When n is large, this exponent is much faster 
than the delay exponent of rj log 2 A obtained in the proof of Lemma 16.11 The two delays experienced by 
a bit are independent by construction. Thus, the dominant delay-exponent remains r] log 2 A as desired. □ 

Notice that the simulated endpoint process depends only on common randomness and the message 
packets. Since the common randomness is known perfectly at the receiver by assumption and the message 
packets are known with a probability that tends to 1 with delay, the endpoint process is also known with 
zero distortion with a probability tending to 1 as the delay increases. 



B. Embedding classical bits 

All that remains is to embed the classical message bits into the historical process. The overall construc- 
tion is described in Fig. [7J First, n is chosen to be large enough so that the Ri stream can be successfully 
embedded in the endpoint process by Lemma 16.21 

Now, n is further increased so that R 2 < Rn^ X °{d) the conditional rate-distortion function for the 

history given the endpoint. This can be done since 

limrwoo Rn lX °(d) = R*(d) - log 2 A by Theorem El 
By choosing an appropriate additional delay, Lemma |6\21 assures us that the receiver will know all the 
past high-priority messages and hence simulated endpoints correctly with an arbitrarily small probability 
of error e. As described in Section HV-Al this means we now have a family of systems (indexed by m) 
that solve the conditional history problem. The condition © translates into 



T+m— 1 n—1 

limsupP(- V -J2\X (k>i) -X7 ki) \ r >>d) = 0. (26) 



It tells us that by picking m large enough, the probability of having excess distortion can be made as 
small as desired. 

The simulated {Zk} containing the high-priority messages are interpreted as the "coverstory" that must 
be respected when embedding messages into the {X^} process. The {Zk} are iid by construction and 
hence Theorem 3 from [18] (full proofs in [19]) applies and tells us that a length m! > m random 
code with drawn independently of each other, but conditional on the iid Z k , can be used to embed 

X I Z X I X 

information at any rate nR 2 < nR n [d+ e) — nRn n (d + e) per vector symbol with arbitrarily low 
probability of error. □ 

The "weak law of large numbers"-like condition ([8]), or something like it, is required for the theorem 
to hold since there are joint source-channel codes for which mutual information cannot be turned into the 
reliable communication of bits at arbitrarily low probabilities of error. Consider the following contrived 
example. Suppose there are two different joint source-channel codes available: one has a target distortion 
of di and the other has a target distortion of d 2 = 10dx. The actual joint code, which is presumed to have 
access to common randomness, could decide with probability to use the second code rather than the 
first. In such a case, the ensemble average mutual information is close to R(di) — log 2 A bits, but with 
non-vanishing probability we might not be able sustain such a rate over the virtual channel. 

We conjecture that for DMCs, if any joint source-channel code exists that hits the target distortion on 
average, then one should also exist that meets © and it should be possible to simultaneously communicate 
two streams of messages reliably with anytime reliability on the first stream and enough residual rate on 
the second. 
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VII. Unstable Gauss-Markov processes with squared-error distortion 
A. Source-coding for Gaussian processes 

The goal here is to prove Corollary 12.11 The strategy is essentially as before. One simplification is 
that we can make full use of Theorem 12.41 and rely on R^(d) = R^(d) — log 2 A. There is thus no rate 
loss in encoding the historical segments on a block-by-block basis rather than using superblocks and 
conditional encodings. The only issue that remains is dealing with the unbounded support when encoding 
the checkpoints. 

The overall approach is: (key differences italicized) 

(a) Look at time in blocks of size n and encode the values of checkpoints X kn recursively to very high 
precision using a prefix-free variable-length code with rate r2,(log 2 A + ex) + L k bits per value, where 
the L k are iid random variables with appropriately nice properties. 

(b) Smooth out the variable-length code by running it through a FIFO queue drained at constant rate 
Ri = log 2 A + €i + e q . Make sure that the delay exponent in the queue is high enough. 

(c) Use the exact value for the ending checkpoint X^+i) n (instead of the quantized X) to transform the 
segment immediately before it so that it looks exactly like a stable backwards Gaussian process of 
length n with initial condition 0. Encode each block of the backwards history process to average- 
fidelity d using a fixed-rate rate-distortion code for the backwards process that operates at rate R^(d) + 

(d) At the decoder, wait time units and attempt to decode the checkpoints to high fidelity. If the FIFO 

queue is running too far behind, then extrapolate a reconstruction based on the last fully decoded 
checkpoint. 

(e) Decode the history process to average-fidelity d and combine it with the recursively quantized 
checkpoints to get the reconstruction. 

a) Encoding the checkpoints: (fTTI) remains valid, but the term W k = A n_1 Y^=o ^~ l W kn +i is not 
bounded since the Wi are iid Gaussians. The W k are instead Gaussian with variance 

n-l 

~2 = A 2(n-l)^ A -2 lcJ 2 



< A 2 ("- X V 2 ^A 



-2i 



A 2n ^ 



A 2 - 1 ' 

The standard deviation a is therefore A n -^§==. Pick I = 2~ n and essentially pretend that this random 
variable W k has bounded support of la during the encoding process. By comparing (fT2b to the above, the 
effective Q is simply = 2^ n a x [^. Define tt = (J\f^ so that the effective Vt = 2^ n n. 



Encode the checkpoint increments recursively as before, only add an additional variable-length code 
for the value of [jf + \\ while treating the remainder using the fixed-rate code as before. The variable 
length code is a unary encoding that counts how many la away from the center the W k actually is. (Fig. [9] 
illustrates the unary code.) Let L k be the length of the /c-th unary codeword. This is bounded above by 

P{L k >3 + j) = P{\W\>jla). 
Let N be a standard Gaussian random variable and rewrite this as 

P(L k > 3 + j) = P(\N\ > j2% n ) < exp(-ij 2 2^) (27) 
and so L k is very likely indeed to be small and certainly has a finite expectation L < 4 if n is large. 
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Fig. 9. Unary encoding of integer offsets to deal with the unbounded support. The first bit denotes start while the nest two bits reflect 
the sign. The length of the rest reflects the magnitude of the offset with a zero termination. The encoding is prefix-free and hence uniquely 
decodable. The length of the encoding of integer S is bounded by 3 + \S\ 



The fixed-rate part of the checkpoint encoding has a £ate that is the same as that given by (TT3T) . except 
that f2 is now mildly a function of n. Plugging in 2~3~ n f2 for f2 in (fl"3T) gives 

„ . L x , iog2(i + ^)+iog 2 (2+g) io g2 r%v 

Rif > max log, A H , — 

\ n n 

I , log 2 (l + S^) + log2(2 + ^) log 2 [fl 
max iog 2 A H 



n n 

2 ^ log 2 (2"^ + K (feT))+ lQ g2(2 1 ^" + I) log 2 rfl 

max log 2 A + —€\ H , 

inn 

Essentially, the required rate Rij for the fixed-rate part has only increased by a small constant |ei. 
Holding A fixed and assuming n is large enough, we can see that 

Rtj = log 2 A + ei (28) 

is sufficient. 

b) Smoothing out the flow: The code so far is variable-rate and to turn this into a fixed-rate R\ = 
log 2 A + ei + e q bitstream, it is smoothed by going through a FIFO queue. First, encode the offset using 
the variable-length code and then recursively encode the increment as was done in the finite support case. 
All such codes will begin with a 1 and thus we can use zeros to pad the end of a codeword whenever 
the FIFO is empty. When n is large, the average input rate to the FIFO is smaller than the output rate 
and hence it will be empty infinitely often. 

c) Getting history and encoding it: Section [IV] explains why such a transformation is possible by 
subtracting off a scaled version of the endpoint. The result is a stable Gaussian process and so [9] reveals 
that it can be encoded arbitrarily close to its rate-distortion bound R^(d) = R^(d) — log 2 A if n is large 
enough. 

d) Decoding the checkpoints: The decoder can wait long enough so that the checkpoint we are 
interested in is very likely to have made it through the FIFO queue by now. The ideas here are similar 
to [7], [35] in that a FIFO queue is used to smooth out the rate variation with good large deviations 
performance. There is ne q slack that has to accommodate L k bits. Because n can be made large, the error 
exponent with delay here can be made as large as needed. 
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More precisely, a packet of size n(ei + log 2 A) + L k bits arrives every n time units where the L k are iid. 
This is drained at rate Ri = e q + e\ + log 2 A. An alternative view is therefore that a point packet arrives 
deterministically every n time units and it has a random service time T k given by n e e ^°^og x e +e^+ log \ ■ 
Define (1 — e' ) = £ , 1+1 °f 2 A , . Then the random service time TL = (1 — e')n-\ — ^ — r- when measured 

in time units or T% = (1 — e' )nRi + L k when measured in bit-units. 

This can be analyzed using large-deviations techniques or by applying standard results in queuing. The 
important thing is a bound on the length L k which is provided by (T27l) . It is clear thad 

P(L k >3 + j) < exp(-ij 2 2^) 

< exp(-2^ L "^ 1 )j)- 

Since an exponential eventually dominates all constants, we know that for any (3 > 0, there exists a 
sufficiently large n so that: 

P(L fc - 3 > j) < 2"» (29) 

Thus, the delay (in bits) experienced by a block in the queue will behave no worse than that of point 
messages arriving every nRi bits where each requires at least ni?i(l — e^) + 3 = — e") bits plus 

an iid geometric (1 — p) number of bits with p = 2~ fl . 

Lemma 16.31 applies to this queuing problem and the second part of that lemma tells us that the delay 
performance is exactly the same as that of a system with point messages arriving every ne" bits requiring 

only an iid geometric number of bits. Since is small, the first part of Lemma [631 applies . Set r = 1, 
then the bit-delay exponent a& is at least 

a b > - log 2 2~ - 2-^ 

which is at least (3 — \ when ne'^ > 3. Converting between bit-delay and time-delay is essentially just a 
factor of log 2 A and so the time-delay exponent is at least ^~\ . But f3 can be made as large as we want 
by choosing n large enough. 

e) Getting the final reconstruction: The history process is added to the recovered checkpoint. This 
differs from the original process by only the error in the history plus the impact of the error in the 
checkpoint. The checkpoint reconstruction-error's impact dies exponentially since the history process is 
stable. So the target distortion is achieved if the checkpoint has arrived by the time reconstruction is 
attempted. By choosing a large enough end-to-end delay 0, the probability of this can be made as high 
as we like. 

However, the goal is not just to meet the target distortion level d with high probability, it is also to hit 
the target in expectation. Thus, we must bound the impact of not having the checkpoint available in time. 
When this happens, the un-interpretable history information is ignored and the most recent checkpoint is 
simply extrapolated forward to the current time. The expected squared errors grow as A 2 ^' where ip is the 
delay in time-units. The arguments here exactly parallel those of Theorem |2.2[ where the FIFO queue is 
acting like an anytime code. Since the delay-exponent of the queue is as large as we want, it can be made 
larger than 2 log 2 A. Thus, the expected distortion coming from such "overflow" situations is as small as 
desired. This completes the proof of Corollary 12.11 □ 



'While this proof is written for the Gaussian case, the arguments here readily generalize to any driving distribution W that has at least 
an exponential tail probability. To accommodate W with power-law tail distributions would require the use of logarithmic encodings as 
described in [36], [37]. This does not work for our case because the unary nature of the encoding is important when we consider transporting 
such bitstreams across a noisy channel. 
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B. Channel sufficiency for communicating Gaussian processes 

This section shows why Corollary 12.21 is true. The story in the Gaussian case is mostly unchanged since 
the historical information is as classical as ever. The only issue is with the checkpoint stream. An error 
in a bit i/j steps ago can do more than propagate through the usual pathway. It could also damage the bits 
corresponding to the variable-length offset. 

Because of the unary encoding and the 2~ n expansion in the effective f2, an uncorrected bit-stream 
error ijj time-steps ago can only impact the error in the checkpoint reconstruction by 4'0(log 2 X)2~ n since 
the worst error is clearly to flip the sign bit and keep the unary codeword from terminating thereby making 
it at most 2ip log 2 A bits long. The current reconstruction is therefore incorrect by an 0(^2^" n A^) change 
in its value. As far as ^-distortion is concerned, the distortion grows by a factor 0(il) 71 2 v ~ n \ v ^) from what 
it would be with correct reconstruction. Asymptotically, the delay is much larger than the block-length 
n and so the polynomial term in front is insignificant relative to the exponential in ijj. If the code has 
anytime reliability a > r/log 2 A, then the same argument as Theorem 12.21 applies and the Corollary holds. 
□ 

VIII. Extensions to the vector case 

With the scalar case explored, it is natural to consider what happens for general finite-dimensional 
linear models where A is replaced with a matrix A and A" is a vector. In the Gaussian process case, these 
will correspond to cases with formally rational power-spectral densities. Though the details are left to the 
reader, the story is sketched here. No fundamentally new phenomena arise in the vector case, except that 
different anytime reliabilities can be required on different streams arising from the same source as is seen 
in the control context [6]. 

The source-coding results here naturally extend to the fully observed vector case with generic driving 
noise distributions. Instead of two message streams, there is one special stream for each unstable eigenvalue 
Aj of A and a singe final stream capturing the residual information across all dimensions. All the sufficiency 
results also generalize in a straightforward manner — each of the unstable streams requires a corresponding 
anytime reliability depending on the distortion function's r\ and the magnitude of the eigenvalue. The 
multiple priority-stream necessity results also follow genericallyH This is a straightforward application 
of a system diagonalizatiorJl argument followed by an eigenvalue by eigenvalue analysis. The necessity 
result for the residual rate follows the same proof as here based on inverse-conditional rate-distortion with 
the endpoints in all dimensions used as side-information. 

The case of partially observed vector Markov processes where the observations C y X are linear in the 
system state requires one more trick. We need to invoke the observability^ of the system state. Instead 
of a single checkpoint pair, use an appropriate numbed of consecutive values for the observation and 
encode them together to high fidelity A. This can be done by transforming coordinates linearly so that the 
system is diagonal, though driven by correlated noise, from checkpoint-block to the next checkpoint-block. 
The initial condition is governed by the self-noise that is unavoidable while trying to observe the state. 
Each unstable eigenvalue will contribute its own log 2 Aj term to the first stream rate and will require the 
appropriate anytime reliability. The overhead continues to be sublinear in n and the residual information 
continues to be classical in nature by the same arguments given here. 

2 The required condition is that the the driving noise distribution W should not have support isolated to an invariant subspace of A. If that 
were to happen, there would be modes of the process that are never excited. 

3 The case of non-diagonal Jordan blocks is only a challenge for the necessity part regarding anytime reliability. It is covered in [6] in the 
control context. The same argument holds here with a Riemann-integrable joint-density assumption on the driving noise. 

4 The linear observation should not be restricted to a single invariant subspace. If it were, we could drop the other subspaces from the 
model as irrelevant to the observed process under consideration. 

5 The appropriate number is twice the number of observations required before all of the unstable subspaces show up in the observation. 
This number is bounded above by twice the dimensionality of the vector state space. The factor of two is to allow each block to have its 
own beginning and end. 
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The partially observed necessity story is essentially unchanged on the information embedding side, 
except that every long block should be followed by a miniblock of length equal to the dimensionality 
k during which no message is embedded and only common-randomness is used to generate the driving 
noise. This will allow the decoder to easily use observability to get noisy access to the unstable state 
itself. 

In [6], these techniques are applied in the context of control rather than estimation. The interested 
reader is referred there for the details. Some simplifications to the general story might be possible in the 
case of SISO autoregressive processes, but we have not explored them in detail. 



We have characterized the nature of information in an unstable Markov process. On the source coding 
side, this was done by giving the fixed-rate coding Theorem |2.11 This theorem's code construction naturally 
produces two streams — one that captures the essential unstable nature of the process and requires a rate 
of at least log 2 A, and another that captures the essentially classical nature of the information left over. The 
quantitative distortion is dominated by the encoding of the second stream, while the first stream serves 
to ensure its finiteness as time goes on. The essentially stable nature of the second stream's information 
is then made precise by Theorem 12.41 which relates the forward D(R) curve to the "backwards" one 
corresponding to a stable process. 

At the intersection of source and channel coding, the notion of anytime reliability was reviewed and 
Theorem 15.11 shows that it is nonzero for DMCs at rates below capacity. Theorem 12.21 and Lemma 16.21 
then shows that the first stream requires a high-enough anytime reliability from a communication system 
rather than merely enough rate. In contrast, Theorems 12.31 and 12.51 show that the second stream requires 
only sufficient rate. Together, all these results establish the relevant separation principle for such unstable 
Markov processes. 

This work brings exponentially unstable processes firmly into the fold of information theory. More 
fundamentally, it shows that reliability functions are not a matter purely internal to channel coding. In the 
case of unstable processes, the demand for appropriate reliability arises at the source-channel interface. 
Thus unstable processes have the potential to be useful models while taking an information-theoretic look 
at QoS issues in communication systems. The success of the "reductions and equivalences" paradigm of 
[5], [19] here suggests that this approach might also be useful in understanding other situations in which 
classical approaches to separation theorems break down. 



It is often conceptually useful to think of generic random variables with Riemann-integrable densities 
as being mixtures of a blurred uniform random variable along with something else. This appendix proves 
Lemma 14.11 

Since the density is Riemann-integrable, 



Thus, fw can be expressed as a non-negative piecewise constant function j' w that only changes every 
5 units plus a non-negative function representing the "error" in Riemann-integration from below. By 
choosing § small enough, the total mass in can be made as small as desired since the Riemann sums 
above converge. 

Choose 5 such that the total mass in is 7. So 



IX. Conclusions 



Appendix I 

Riemann-integrable densities as mixtures 




iV = (i-7)( 



1-7 




) + #) 



(30) 
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and thus W can thus be simulated in the following way: 

1) Flip an independent biased coin C 7 with probability of heads 7. 

f" 

2) If heads, independently draw the value of W from the density — corresponding to a random variable 
W". 

3) If tails, independently draw the value of W from the random variable W" with piecewise constant 

density j^-. This can clearly be done by using a discrete random variable Gs plus an independent 

f' 

uniform random variable Us so that W" = G5 + Us has density 
This proves the result. □ 

Appendix II 

Entropy bound for quantized random variables with bounded moments 

Lemma 2.1: Consider a random variable Z that is quantized to precision A so Z q = Qa(Z). Further 
suppose that E[\Z\ r i] < K where K > A 71 . Then 

H(S ) < 7 + +21og 2 ^ + log 2 1 + 21og 2 log 2 1 + i+J^. (3!) 

77 77 A A 77 In 2 

\z\ 

Proof: Let Z q = SA where S is an integer. Then \S\ < 1 + ^ and so 
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Applying the Markov inequality gives 



P(|5|> S )<min(l,^ r ^). (32) 

The integer S can be encoded into bits using a self-punctuated code using less than 4 + log^l^l) + 
21og 2 (l + log 2 (|5| + 1)) bits to encode S 7^ [38]. First encode the sign of S using a single bit. There 
are at most 1 + log 2 (|S'| + 1) digits in the natural binary expansion of \S\. This length can be encoded 
using at most 2 + 21og 2 (l + log 2 (|S'| + 1)) bits by giving its binary expansion with each digit followed 
by a if it is not the last digit, and a 1 if it is the last digit. Finally, \S\ itself can be encoded using at 
most 1 + log 2 \ S\ bits. 

Since the entropy must be less than the expected code-length for any code, 

H(S) < 4 + J B[log 2 (|5|)] + 2S[log 2 (l + log 2 (|5| + l))] 

/■OO /"OO 

= 4 + / V(\og 2 (\S\) > l)dl + 2 / P(log 2 (l + log 2 (|S| + 1)) > l)dl. 
Jo Jo 
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First, we deal with the dominant term 

POO 

7>(log 2 (|S|) > l)dl 







poo 

= / P(\S\ > 2 l )dl 
Jo 

roc 2 r)+1 K 

< / min(l, — 2~^ l )dl 

~ Jo A" 



-log 2 ( — )+ / 

1 , log 2 ^ , 1 1 l + ln2 

1 + + log 2 x + — rr 

77 A 77 m 2 



Next, consider the smaller term 



/>oo 

2 / P(log 2 (l + log 2 (|S| + l)) >Z)dZ 
Jo 

poo 

= 2 V(\og 2 (\S\ + 1) >2 ; -l)d/ 
= 2 j( 7>(|S| + 1 > —)dl 

POO 

< 2(1+ / >2 2l )dl) 

Jo 

roo 2^+! /V" , 

< 2 + 2 / min(L— 2~^ )dl 

Jo A r ' 



!og2-s5r 



00 



= 2 + 2 log 2 tai A " + 2 / 
V Jo 

< 2 + 2 log 2 + 2 log 2 (l + -) + 2 log 2 log 2 -i 



77 77 A r/ In 2 



< 2 + 21og 2 ^^ + 21og 2 log 2 



7/ A 77 In 2 

where the final inequalities come from the concave PI nature of log 2 and lower bounding 2 l with just I. 
Putting everything together gives the desired result. □ 

Appendix III 
Proof of Proposition 13.21 

From © and ©, we know for every e 2 > 0, if A is small enough and n is large enough, that there 
exists a random vector F n_1 so that - Yl^o 1 Pi^t, Yi) = d + €3 and that even the best such vector must 
satisfy 

n{Rl{d) - e 2 ) < IiXr 1 ; Y-- 1 ) < n(R*(d) + e 2 ). 
Decompose the relevant mutual information as 

/(x?- 1 ; rr x \z\ ©) = z q m + yr\ z«\e). 03) 

To get the desired result of asymptotic equality, this conditional mutual information has to be both upper 
and lower bounded. To upper bound the conditional mutual information, we lower bound /(Xq _1 ; Z q \Q) 
and upper bound /(Xq _1 ; Yq~ 1 , Z q \Q). Vice-versa to get the lower bound. 
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A. Lower bounding I(X^~ l ; Z q \Q) 

The first term is easily lower bounded for A small enough since 

IiX^- 1 ; Z q \0) = H(Z q \Q)- H(Z q \X?,0) 
= H(Z q \Q) 

> H(z q \e,w£- 2 ) 

> Llog 2 A— L J 

= L(n-l)log 2 Aj. (34) 

This holds since conditioned on the final dither 0, the quantized endpoint is a discrete random variable 
that is a deterministic function of X n -i and conditioning reduces entropy. But Z q conditioned on the 
driving noise Wq~ 2 is just the A-precision quantization of A" -1 times a uniform random variable of 
width A and hence has discrete entropy > log 2 A n_1 . 



B. Upper bounding I{X^ 1 - Z q \Q) 

To upper-bound this term, Lemma 12.11 can be used to see 

I{X^- 1 ;Z q \Q) = H(Z q \Q) 

„ log 2 K' ol log 2 K' . 1 nl , 1 5 + ln2 

< 7+^ + 21og 2 ^^ + log 2 - + 21og 2 log 2 - + 

rj 7] A A 77 m 2 

where K' is an upper-bound to Such an upper-bound is readily available since 



E[\Z\'<] = E[\X n \v] 

= ^ 1) i?[(^ T + |^n-lA-W J |)''] 

< A^^maxC^J^n-lA-W, 

i=0 

< Ar ' (n - 1) (^iy + ™). 
Using this for K' and taking logs shows 

log 2 K' log 2 (A^ 1 )(^ + 2^)) 



7] 7] 

= l + (n-l)log 2 A + 



log 2 (^ + VK) 



V 

Substituting this in gives the desired bound 

I(X^-Z q \Q) 

< 8 + (n - 1) log 2 A + 2 log 2 (l + (n - 1) log 2 A + log ^ aCT + 2?? ^ ) } + { 1 + 2 { { 1 + 5 + hg } 

rj A A 77 In 2 

There is only a single 0(n) term above, and it is (n — 1) log 2 A. Everything else is o(n). 
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C. Lower bounding /(X™" 1 ; F n_1 , Z?|0) 
We need to establish 

/W" 1 ; yj*" 1 , Z*|6) > n(i£(d) - e 2 ). (37) 
This is immediately obvious from 

JW- 1 ;!^- 1 ,^^) = /(Xo"" 1 ;^- 1 ^) + iJ(Z 9 |e,y "- 1 ) - #(^0, y o n -\ X™" 1 ) 

= /(x o "- 1 ;y o "- 1 |0) + /j(z"|e,y o ^ 1 ) 

> n(it£(d)-e 2 ). 

The first equality is just expanding the mutual information and recognizing the fact that Z q is discrete once 
conditioned on the dither and so H is the regular discrete entropy here. Let Q(A,e) denote the dithered 
scalar quantizer used to generate the encoded checkpoints, just appropriately translated so it can apply to 
the X giving Z q = Q(A,e)(X n -i)- The next equality is a consequence of this deterministic relationship. 
Finally, the discrete entropy is always positive and can be dropped to give a lower bound. 

D. Upper bounding /(X™" 1 ; F™" 1 , Z"\B) 

The second term of (|33l is upper bounded in a way similar to the first term. We need to establish 

/(X™- 1 ; Y n ~\ Z*|0) < n(R*(d) + e 2 ) + o(n). (38) 
Expand the mutual information as before 

< nx^-Y^ie) + HiZ^Y^) 

= n{R*{d) + e 2 ) + H{Z* - Q (A , e) (Y n ^)\e, Y n ^) 

< n(R*(d) + e 2 ) + log 2 3 + H(Q {A>e) (X n -i ~ K-i)|0). 

The first inequality comes from dropping conditioning. After that, the quantizer Q(A,e) can be applied 
to y n _i so that Z q — Q(A,e)(y»-i) = where 5 is an integer-valued random variable representing 
how many steps up or down the A-quantization ladder are needed to get from Q(A,e) C^n-i) to Z q . The 
difference of two quantized numbers differs by at most 1 quantization bin from the quantization of the 
difference. This slack of up to 1 bin in either direction can be encoded using log 2 3 bits. 

At this point, Lemma 12.11 applies using the trivial upper bound n(d + e 3 ) for the 77-th moment of 
X n _i — y n _i, since the worst case is for the entire distortion to fall on the last component of the vector. 

H(Q {A , @) (X n ^-Y n ^)\Q) 

\og 2 n(d + e 3 ) \og 2 n(d + e 2 ) 1 1 5 + ln2 

< 7 + + 2 log 2 + log 2 — + 2 log 2 log 2 — + — — — 

7] rj A A r\ In 2 

The log 2 n term is certainly o(n). The only other term that might raise concern is log 2 ^, but that is 
o(n) since (TTSl tells us that we are already required to choose n much larger than that to have R\ close 
to log 2 A in the first stream. The order of limits is to always let n go to infinity before A goes to zero. 

E. Putting pieces together 

With (1381) established, it can be applied along with (|34|) to (l33l) and gives 

/(XT 1 ; Y£- x \Z q , 0) < n{Rl{d) - log 2 A + e 2 ) + o{n). (39) 

Taking n to 00 and dividing through by n establishes the desired result on the upper bound. 
Similarly putting together (|36*1 ) and (1371) gives 

I{X%~\ Y^\Z q , 0) > n(R*{d) - log 2 A - e 2 ) - o{n). (40) 

Taking n to 00 and dividing through by n establishes the desired result on the lower bound. 

But e 2 was arbitrary and this establishes the desired result. □ 
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Appendix IV 
Proof of Theorem 15. II 

Interpret the random ensemble of infinite tree codes as a single code with both encoder and decoder 
having access to the common-randomness used to generate the code-tree. Populate the tree with iid channel 
inputs drawn from the distribution that achieves E r (R) for block codes. Theorem 7 in [26] tells us that 
the code achieves anytime reliability a = E r (R) since the analysis uses the same infinite ensemble for 
all i and delays. 

Alternatively, this can be seen from first principles for ML decoding by observing that any false path 
B\ can be divided into a true prefix B 3 {~ and a false suffix B 1 -. The iid nature of the channel inputs 
on the code tree tells us that the true code-suffix corresponding to the received channel outputs from 
time -| to t is independent of any false code-suffix. Since there are < 2 r ^~r) such false code-suffixes 
(ignoring integer effects) at depth j, Gallager's random block-coding analysis from [8] applies since all 
that it requires is pairwise independence between true and false codewords. 

V(Bj{t) ^ BjlBl' 1 already known) 

< P(error on random code with 2 /?( - < ~«- ) words and block length t — ppr]) 

R 

< 2~ {t ~R~ 1)Er(R) 

The probability of error on B\ can be bounded by the union bound over j — 1 . . . i. 

i 

V(B[{t) ± B[) < ^V{Bj{t) ^ BjlBt 1 already known) 



)E r (R) 



3=1 

oo 

3=0 

= K2- {t - J R )Er{R) 

The exponent for the probability of error is dominated by the shortest codeword length in the union bound, 
and this corresponds to t — -|. □ 

Appendix V 
Proof of Property 16.11 



\X' kn -X' kn \ > xV-fifflMj-Mjl 



oo 



\~ n[t ~ 3 \Mi- Mi] 



i=j+i 

oo 

ni\ 



> X n(k - j) (3(\ Mj - Mj | - 2A^ n X 

8=0 

\-n 

> X n{k - j) p(2 1 - nR -2 



1 - \~ n ' 



X n{k-j) 2 ^ 2 -nR 



A n - 1 

which is positive as long as 2~ nR > or n R < log 2 (A n — 1). We can thus use K = 2/3(2"' 



2 n(io g2 A-i?) _ and the p ropert y j s proved. □ 
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